This file contains an example of the use of persistent homology on real data. We consider a small data set as this is an example. The Dow Jones Industrial Average basically sums the stock prices of 30 major publicly-traded companies. Here, we look one period of US economic history and base our stock choices roughly on the Dow Jones leading into the 2007-2009 financial crisis.
We can get stock data using several nice R libraries, like BatchGetSymbols. This uses a stock data API of your choice (or the default) to get historical prices given the ticker or symbol list.
From Nov 21, 2005 to Feb 19, 2008, the Dow consisted of a steady set of stocks, and on Feb 19 Altria and Honeywell were dropped and replaced with Bank of America and Chevron. Then on Sept 22 2008 AIG was replaced by Kraft Foods, and on June 8 2009 Citigroup and General Motors were dropped and replaced by Cisco Systems and Travelers. Let’s first analyze the stable years of 2006 and 2007 plus change, the run-up to the crisis.
first.date.finance <- as.Date('2005/11/22')
last.date.finance <- as.Date('2008/02/19')
tickers <- c('MMM','AA','MO','AXP','AIG','T','BA','CAT','C','KO','DWDP','XOM','GE','GM','HD','HP','HON','IBM','INTC','JNJ','JPM','MCD','MRK','MSFT','PFE','PG','UTX','VZ','WMT','DIS')
data.finance <- BatchGetSymbols(tickers = tickers,
first.date = first.date.finance,
last.date = last.date.finance)
##
## Running BatchGetSymbols for:
## tickers = MMM, AA, MO, AXP, AIG, T, BA, CAT, C, KO, DWDP, XOM, GE, GM, HD, HP, HON, IBM, INTC, JNJ, JPM, MCD, MRK, MSFT, PFE, PG, UTX, VZ, WMT, DIS
## Downloading data for benchmark ticker | Found cache file
## MMM | yahoo (1|30) | Found cache file - Youre doing good!
## AA | yahoo (2|30) | Found cache file - Feels good!
## MO | yahoo (3|30) | Found cache file - Nice!
## AXP | yahoo (4|30) | Found cache file - Good job!
## AIG | yahoo (5|30) | Found cache file - Looking good!
## T | yahoo (6|30) | Found cache file - Good stuff!
## BA | yahoo (7|30) | Found cache file - Good stuff!
## CAT | yahoo (8|30) | Found cache file - OK!
## C | yahoo (9|30) | Found cache file - Youre doing good!
## KO | yahoo (10|30) | Found cache file - Looking good!
## DWDP | yahoo (11|30) | Found cache file - You got it!
## XOM | yahoo (12|30) | Found cache file - OK!
## GE | yahoo (13|30) | Found cache file - Got it!
## GM | yahoo (14|30) | Found cache file - Error in download..
## HD | yahoo (15|30) | Found cache file - You got it!
## HP | yahoo (16|30) | Found cache file - Looking good!
## HON | yahoo (17|30) | Found cache file - You got it!
## IBM | yahoo (18|30) | Found cache file - OK!
## INTC | yahoo (19|30) | Found cache file - Well done!
## JNJ | yahoo (20|30) | Found cache file - Good stuff!
## JPM | yahoo (21|30) | Found cache file - You got it!
## MCD | yahoo (22|30) | Found cache file - Good stuff!
## MRK | yahoo (23|30) | Found cache file - OK!
## MSFT | yahoo (24|30) | Found cache file - Good stuff!
## PFE | yahoo (25|30) | Found cache file - Feels good!
## PG | yahoo (26|30) | Found cache file - Got it!
## UTX | yahoo (27|30) | Found cache file - Got it!
## VZ | yahoo (28|30) | Found cache file - You got it!
## WMT | yahoo (29|30) | Found cache file - OK!
## DIS | yahoo (30|30) | Found cache file - Nice!
widestock_finance <- dcast(data.finance$df.tickers[,6:8], ref.date ~ ticker, value.var="price.adjusted")
Let’s also look at today’s Dow, more or less.
first.date.today <- as.Date('2013/11/01')
last.date.today <- as.Date('2019/01/25')
tickers <- c('MMM','AAPL','AXP','BA','CAT','CVX','CSCO','KO','DWDP','XOM','GS','HD','IBM','INTC','JNJ','JPM','MCD','MRK','MSFT','NKE','PFE','PG','TRV','UNH','UTX','VZ','V','WMT','WBA','DIS')
data.today <- BatchGetSymbols(tickers = tickers,
first.date = first.date.today,
last.date = last.date.today)
##
## Running BatchGetSymbols for:
## tickers = MMM, AAPL, AXP, BA, CAT, CVX, CSCO, KO, DWDP, XOM, GS, HD, IBM, INTC, JNJ, JPM, MCD, MRK, MSFT, NKE, PFE, PG, TRV, UNH, UTX, VZ, V, WMT, WBA, DIS
## Downloading data for benchmark ticker | Found cache file
## MMM | yahoo (1|30) | Found cache file - Good stuff!
## AAPL | yahoo (2|30) | Found cache file - Nice!
## AXP | yahoo (3|30) | Found cache file - Looking good!
## BA | yahoo (4|30) | Found cache file - Nice!
## CAT | yahoo (5|30) | Found cache file - You got it!
## CVX | yahoo (6|30) | Found cache file - Looking good!
## CSCO | yahoo (7|30) | Found cache file - You got it!
## KO | yahoo (8|30) | Found cache file - OK!
## DWDP | yahoo (9|30) | Found cache file - Feels good!
## XOM | yahoo (10|30) | Found cache file - Looking good!
## GS | yahoo (11|30) | Found cache file - Well done!
## HD | yahoo (12|30) | Found cache file - OK!
## IBM | yahoo (13|30) | Found cache file - Nice!
## INTC | yahoo (14|30) | Found cache file - Got it!
## JNJ | yahoo (15|30) | Found cache file - Nice!
## JPM | yahoo (16|30) | Found cache file - Good job!
## MCD | yahoo (17|30) | Found cache file - Got it!
## MRK | yahoo (18|30) | Found cache file - Mas bah tche, que coisa linda!
## MSFT | yahoo (19|30) | Found cache file - You got it!
## NKE | yahoo (20|30) | Found cache file - Good stuff!
## PFE | yahoo (21|30) | Found cache file - OK!
## PG | yahoo (22|30) | Found cache file - You got it!
## TRV | yahoo (23|30) | Found cache file - Youre doing good!
## UNH | yahoo (24|30) | Found cache file - You got it!
## UTX | yahoo (25|30) | Found cache file - Youre doing good!
## VZ | yahoo (26|30) | Found cache file - Well done!
## V | yahoo (27|30) | Found cache file - Well done!
## WMT | yahoo (28|30) | Found cache file - Feels good!
## WBA | yahoo (29|30) | Found cache file - Got it!
## DIS | yahoo (30|30) | Found cache file - Mais contente que cusco de cozinheira!
widestock_today <- dcast(data.today$df.tickers[,6:8], ref.date ~ ticker, value.var="price.adjusted")
We consider log returns rather than prices, as log returns are more stationary.
logrets_finance <- apply(widestock_finance[,2:length(widestock_finance)], 2,
function(x) diff(log(x), lag=1))
logrets_today <- apply(widestock_today[,2:length(widestock_today)], 2,
function(x) diff(log(x), lag=1))
What shape is our data?
dim(logrets_finance)
## [1] 561 29
dim(logrets_today)
## [1] 1314 30
For some reason I get one NA appearing in several stocks. Since it’s one observation out of many, for today I’m going to replace that NA with zero.
logrets_finance[is.na(logrets_finance)] <- 0
logrets_today[is.na(logrets_today)] <- 0
Now let’s loop through our observations and split it into sixty-three day chunks. Sixty-three days gives about a quarter in the business year of 252 days.
Timing note: using the R package “TDA”, computation for the Vietoris-Rips complex using GUDHI or PHAT gets very long and so looking at quarters is easier for the impatient researcher. However, some research indicates that 15 or 20 day increments are preferable for analysis of network correlation – and with Ripser as your background engine through the “TDAstats” package, this is feasible. (Twenty-one days give about one month of business days). We’ll look at correlations between the stocks over these increments.
(Here I need to thank Fiona Jiang and Ayman Ahmed for their initial code; I’ve modified it but they started this!)
time_window = 63
num_of_finance_days = length(logrets_finance[,1])
num_time_points_fin = floor(num_of_finance_days/time_window)
num_of_today_days = length(logrets_today[,1])
num_time_points_today = floor(num_of_today_days/time_window)
num_stocks_fin = length(colnames(logrets_finance))
num_stocks_today = length(colnames(logrets_today))
Then we can make correlation matrices for each of these time windows. I’ll split each dataframe into pieces the length of our time window and then apply correlation and a shift to make it a distance matrix.
logrets_today.split <- split(as.data.frame(logrets_today), (seq(nrow(logrets_today)) - 1) %/% time_window)
cor_mat_today <- lapply(logrets_today.split, cor)
dis_mat_today = lapply(cor_mat_today, function(x) {sqrt(2*(1-x))})
logrets_fin.split <- split(as.data.frame(logrets_finance), (seq(nrow(logrets_finance)) - 1) %/% time_window)
cor_mat_fin <- lapply(logrets_fin.split, cor)
dis_mat_fin = lapply(cor_mat_fin, function(x) {sqrt(2*(1-x))})
The dissimilarity matrices give perfectly anticorrelated stocks distance 2 and perfectly correlated stocks distance 0, which is what we feel makes sense here. (Correlation isn’t always positive, so this is a shift to make it a distance.)
Here I’ll use the TDAstats package to compute persistent homology and plot barcodes. I’ll time the barcode calculation as well.
TS_barcode_function <- function(input){
barcode_output <- calculate_homology(input, format="distmat", dim=2)
}
pm <- proc.time()
TS_barcodes_list_today <- lapply(dis_mat_today,TS_barcode_function)
TS_barcodes_list_fin <- lapply(dis_mat_fin,TS_barcode_function)
total_time_TDAstats <- proc.time()-pm
print(total_time_TDAstats)
## user system elapsed
## 0.100 0.002 0.109
Now loop through and plot:
my_plot_function_fin <- function(idx){plot_persist(TS_barcodes_list_fin[[idx]]) + ggtitle(widestock_finance$ref.date[idx*time_window])+ xlim(c(0,2)) + ylim(c(0,2))}
lapply( seq_along(TS_barcodes_list_fin),my_plot_function_fin)
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
# mapply is giving me trouble today
my_plot_function_today <- function(idx){plot_persist(TS_barcodes_list_today[[idx]]) + ggtitle(widestock_today$ref.date[idx*time_window])+ xlim(c(0,2)) + ylim(c(0,2))}
lapply(seq_along(TS_barcodes_list_today),my_plot_function_today)
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
##
## [[12]]
##
## [[13]]
##
## [[14]]
##
## [[15]]
##
## [[16]]
##
## [[17]]
##
## [[18]]
##
## [[19]]
##
## [[20]]
##
## [[21]]
I’d like to be able to set the axes so that these pictures are more directly comparable and also set titles, but this probably reflects my lack of experience with the TDAstats package.
Here I’ll use the ‘TDA’ package to compute persistent homology and plot barcodes.
barcode_function <- function(input){
barcode_output <- ripsDiag(input,maxscale=2,maxdimension = 3,dist="arbitrary",library=c("GUDHI", "PHAT"))
}
pm <- proc.time()
barcodes_list_today <- lapply(dis_mat_today,barcode_function)
barcodes_list_fin <- lapply(dis_mat_fin,barcode_function)
total_time_TDA <- proc.time()-pm
print(total_time_TDA)
## user system elapsed
## 34.114 1.908 41.236
for (j in seq(from=1, to=num_time_points_fin)){
# plotting barcodes
plot(barcodes_list_fin[[j]][["diagram"]],barcode = TRUE, main = widestock_finance$ref.date[j*time_window])
}
for (j in seq(from=1, to=num_time_points_today)){
# plotting barcodes
plot(barcodes_list_today[[j]][["diagram"]],barcode = TRUE, main = widestock_today$ref.date[j*time_window])
}
TDAstats is a lot faster, because Ripser is faster!
print(total_time_TDA)
## user system elapsed
## 34.114 1.908 41.236
print(total_time_TDAstats)
## user system elapsed
## 0.100 0.002 0.109
The idea here is based on Marian Gidea’s paper ``Topology Data Analysis of Critical Transitions in Financial Networks’’, which one can find at *https://arxiv.org/abs/1701.06081*. With my students at the University of Minnesota, we’re extending this work and have several papers in progress.